Goal of the exploratory analysis

Our team has been assigned to study in a descriptive way the Seoul bike sharing service. For this purpose we will use both descriptive and inferential statistics in order to find correlations, trends and patterns between the different variables of the ‘SeoulBike’ database.

Problematic

  • To what extent has the Seoul bike sharing service been a success since its start-date ?

  • How much can this time path evolution be explained by the dataset’s variables ?

Introducing the data

First of all, we import the ‘SeoulBike’ dataset which is a .csv file, using the read.csv() function. The database focuses on the number of rented bikes in Seoul from December 2017 to November 2018.

The dataset is made of 8760 observations and 14 variables. There are several meteorological variables among which the temperature in °C, the level of solar radiation, the wind speed…

The other variables are time related. For each hour of day corresponds a rented bike count, which means there are 24 counts per day.

The dataset’s structure

Before digging into the data analysis it is essential to transform the ‘Date’ variable to the appropriate format.

Here is a summary of the types of variables included in the dataframe. There are both qualitative and quantitative variables.

variable class
Date Date
Rented.Bike.Count integer
Hour integer
Temperature..C. numeric
Humidity… integer
Wind.speed..m.s. numeric
Visibility..10m. integer
Dew.point.temperature..C. numeric
Solar.Radiation..MJ.m2. numeric
Rainfall.mm. numeric
Snowfall..cm. numeric
Seasons character
Holiday character
Functioning.Day character

As we have a huge number of observations in the dataset,it will not be needed to verify the normality hypothesis of the samples to carry out the statistical tests.

In order to optimize our code we created an automated t-test function whose alternative is ‘greater’.

Going into the data, we found that no bikes were rented during the no functioning days. We think it may be due to the Seoulite cultural landscape. That’s why we assume the bike rental service is closed during these days, as well as banks, post offices,…

In order to be more accurate we decided to delete the rows related to the no functioning days.

Total Rented Bike Count
Functioning day 6,172,314
No functioning day 0

Correlation matrix

The previous correlation matrix indicates a strong correlation between ‘Dew.Point.temperature..C.’ and ‘Temperature..C.’ which means a huge colinearity of the two variables.

In the mean time, there are 5 insignificant correlations. The latter match the zeros in the matrix.

Here are the variables with positive influence on the number of rented bikes (decreasing order) :

  • Temperature
  • Hour
  • Solar Radiation
  • Wind Speed
  • Visibility

On the other side these are the variables with negative impact on the number of rented bikes (decreasing order) :

  • Humidity
  • Rainfall
  • Snowfall

Overall aspects of the number of rented bikes

Statistics

Above all, this variable counts the hourly number of rented bikes for each day between December 2017 and November 2018.

Descriptive statistics about the number of rented bikes per hour in Seoul
Min 25% Median 75% Max Mean Sd
Rented.Bike.Count 2 214 542 1 084 3 556 729.2 642.4

Looking at the table we notice an important scope of the ‘Rented.Bike.Count’ variable which means that demand for rental bikes has been fluctuating during the whole period.

From this plot we can approximate the hourly rented bike count’s expected value using the empirical mean as estimator : \[\widehat{\mathbb{E}(X)} = \overline{X}\]

where \(X\) stands for the rented bike count variable. We find \(\overline{x} \approx 729\).

This result implies there is about 50% chance that the hourly number of rented bikes be lower (resp. greater) than 729.

The bike sharing service’s evolution

On the following plot, an increase of the rented bikes count is noted from March to October. The better the weather is, the more people ride their bike.

Furthermore, an overall increase of the number of rented bikes draws attention between the first (December 2017) and last (November 2018) months. We’ll check its significance by testing it.

There also might be seasonality in the time series which might be caused by the weather. If we had the data over a larger period we may observe variations in the rented bike count that occur at specific regular intervals. It could be a regular increase from the end of Spring to the beginning of Autumn and a regular decrease from Autumn to Spring.

Has the Seoul bike sharing service been a success since its start-date ?

In order to answer the previous question we compute two samples from the ‘SeoulBike’ dataset : the first one dealing with the data related to December 2017 - the bike sharing service’s start month - and the second one representing the November 2018 data which is the last month of the database.

One cannot but admit that the rented bike count has almost increased threefold. We will try to explain this important rise.

Overview of the two samples’ third rows
Date Rented.Bike.Count
2017-12-01 254
2017-12-01 204
2017-12-01 173
2018-11-01 584
2018-11-01 524
2018-11-01 362

By using the means of the Student test function we automated before, we compare the two samples’ means to check whether the number of rented bikes is different between the two periods.

We compute the following t-test with a 5\(\%\) first species risk.

\[\left\{ \begin{array}{ll} H_0 : & \mu_1 = \mu_2 \\ H_1 : & \mu_1 > \mu_2 \end{array} \right.\]

\(\mu_1\) stands for the second sample rented bike count’s expected value and \(\mu_2\) the first one’s.

Comparison of the 2 samples with a Student test
mean in Dec 18 mean in Dec 17 p-value
Test results 718.7 249.1 7.858e-102

The t-test’s p-value being basically equal to 0, it can be said that \(\mu_1\) is significantly higher than \(\mu_2\). In other words the number of rented bikes has significantly increased since its start-date.

The following plots depict this positive evolution. The daily count of rented bikes is plotted for both December 2017 and November 2018.

Computing the percent change between the two daily averages we found that the daily average rented bike count has increased by about 189%, that is to say it has almost been tripled over the period.

The next two parts will aim at finding relationships between the rented bike count variable and the other ones in order to explain the Seoul service’s success.

Influence of time variables on the number of rented bikes

Rented bike count per month

Firstly we decided to aggregate the data by suming the number of bikes that have been rented during a month.

We will seperate the months into two classes on the basis of the monthly rented bike count median.

From the table, one notes a dichotomy between months. June, July, May, Sept, Aout, Oct stand apart, especially June, with a total of nearly 900 000 rented bikes. This may reflect the more convenient weather.

On the opposite, winter-related months such as January, February and December don’t do quite well. Indeed, the total amount of these 3 months do not even reach half a million. How can this be explained ? These are cold months. Moreover, the service has only been set up in December. It had not reached its maturity yet.

Rented bike count per day

In this part we computed some statisical indicators about the daily number of rented bikes. What is striking in the following tables is the increase of both the mean and the variability of the number of bikes which are rented each day.

Statistical properties of the daily average number of rented bikes for each month
Month Mean sd Error bound
Jan 4838.90 1395.29 511.80
Feb 5422.61 1917.06 743.36
Mar 12277.23 4040.71 1482.15
Apr 18076.79 7563.76 2877.10
May 23569.60 8668.03 3236.70
Jun 29896.23 6226.19 2324.90
Jul 23692.26 7439.63 2728.88
Aug 21028.61 5174.10 1897.88
Sep 25908.15 6207.25 2507.16
Oct 23238.39 5867.61 2275.22
Nov 17248.70 5043.17 1995.01
Dec 5978.39 1943.16 712.76

Thanks to the following chart, our previous assumptions are confirmed analytically. June is clearly above other months and the three winter-related months had a hard time compared to the other months.

The rise in the confidence intervals’ width illustrates that the more bikes are rented, the more fluctuation appears.

Hourly rented bike count

Deeping into the study, we had to focus on the hourly rented bike count. We’ve cut the days into 4 periods :

  • Night
  • Morning
  • Afternoon
  • Evening

At first sight, it seems like the bike sharing service’s number of users increases from 6 a.m. to 6 p.m. then decreases until reaching its minimum level at 4 a.m.

Are to be compared the daily time and the night time rents. Test results show quite a huge and significant gap between the two periods’ means with an extremely low p-value. Therefore, the null hypothesis of equality of means is rejected.

Do people rent more bikes during daytime in Seoul ?
mean in group DayTime mean in group NightTime p-value
Test results 817.6 624.5 6.174e-44

Another comparison was made, maybe a little bit less obvious : number of renting bikes in the afternoon versus in the evening. The associated test led us to this conclusion : there isn’t a meaningful difference between the two means.

Number of bikes rented in the afternoon versus in the evening
Afternoon Evening p-value
Test results 1 016 1 011 0.424

At that point, one question arises : how could the Seoul bike sharing service optimize its supply of bikes during daytime ? Given our observations and test results, it might be wise to prioritize the service between 7am to 10 pm.

Impact of the holidays on the number of rented bikes in Seoul

Is the number of rented bikes influenced by the holidays ?

In response, we decided to draw a boxplot representing the rented bike count depending on the two-levels-variable ‘Holiday’.

We can easily notice the median on vacation time is half the size of the other one. Additionally, each “No Holidays” quantile is much more higher than its “rival”. This probably reflects a negative impact of the holidays.

Besides, we discern a lower spread of rented bikes on holidays, whereas the higher values tend to skyrocket on “No Holidays”.

To ensure we are not wrong, we obviously needed to test it. We wanted to know whether or not the impact of holidays on bike rental is statistically significant. To do so, we carried out another mean-test.

Is the impact of holidays on bike rental statistically significant ?
mean in group No Holiday mean in group Holiday p-value
Test results 739.3 529.2 1.501e-12

The results are clear. Holidays bring a noteworthy impact on the number of rented bikes.

Analysis of the meteorological variables of the dataset

Now that we have found patterns between the time variables and the rented bike count variable, it seems relevant to focus on the other part of the dataset. It’s time to use the weather-related variables to explain the evolution of the Seoul bike sharing service.

What is the weather like in Seoul ?

Descriptive statistics about some meteorogical variables
Min 25% Median 75% Max Mean Sd
Temperature..C. -17.8 3 13.5 22.7 39.4 12.77 12.1
Visibility..10m. 27 935 1 690 2 000 2 000 1 434 609.1
Solar.Radiation..MJ.m2. 0 0 0.01 0.93 3.52 0.5679 0.8682
Wind.speed..km.h. 0 3.24 5.4 8.28 26.64 6.213 3.723

The temperatures

As shown in both the density plot and the previous table, the Seoul temperature fluctuates quite much. From the plot we can divide the temperature’s distribution into two distinct groups : cold and warm temperatures.

These two features are the consequence of the city’s continental climate.

The following stacked density graph highlights the fact developed above. The Winter’s density is almost perfectly symmetrical to the Summer’s. The Spring’s and Autumn’s densities can be viewed as transition periods between the two opposite seasons.

The solar radiation

Seoul is not known to be a sunny place. On top of that, the solar radiation level does not fluctuate that much.

The wind speed

On average the Seoul wind speed is equal to 6 km.h which is much lower than the worldwide average (~35 km/h). The low variability between seasons indicates Seoul is a city in which there is very little wind throughout the year. This may be a good point for bike rental.

How do meteorological variables influence the number of rented bikes in Seoul ?

The evolution of rented bikes through seasons

Winter has the lowest median and the number of rented bikes is less spread than for the other seasons. There is no need to carry out a test to verify whether the number of rented bikes is lower during Winter.

However, the boxplots for the three other seasons led us to conduct a one-way anova test for comparing means.

The test’s hypothesis are defined as follow :

\[\left\{ \begin{array}{ll} H_0 : & \mu_i = \mu\ ;\ \forall i =1,2,3 \\ H_1 : & \exists \ i \neq j \ | \ \mu_i \neq \mu_j \end{array} \right.\]

where \(\mu_i\) represents the rented bike count expected value for the season \(i\).

Anova test’s results
Test results
Df 2.00
Test statistic 110.63
p-value 0.00

As the p-value is less than the 0.05 significance level, we can conclude there are significant differences in terms of rented bikes among the seasons.

The impact of the temperatures

There is an inverted U-shaped relationship between the daily rented bike count and the daily average temperature. It implies there is an optimal temperature level which maximizes the number of rented bikes.

Using the two dashed vertical lines representing the daily average temperatures’ median and third quantile, we created a categorical variable that distinguishes the temperature levels :

  • ‘Low’ : the average temperature is lower than the median of the average temperatures
  • ‘Medium’ : the average temperature is included between the median and the third quantile of the average temperatures
  • ‘High’: the average temperature is higher than the third quantile of the average temperatures

Then we conducted a t-test to compare means between the levels ‘Medium’ and ‘High’.

Does the daily number of rented bikes differ in temperatures ?
mean in group High mean in group Medium p-value
Test results 25 010 23 530 0.08812

As the p-value is higher than the 0.05 significance level, one may conclude there are no significant differences between the daily number of rented bikes depending on the ‘Medium’ and ‘High’ temperature levels.

In other words there is a similar pattern between the points located on both sides of the temperature which maximizes the number of rented bikes.

Nevertheless the p-value being quite low, if \(\alpha\) > 0.09 we shall reject the null hypothesis of equal means. In other words we couldn’t accept the hypothesis of equal means if we lowered the test’s level of confidence.

The relationship between bike rental and the solar radiation level

There is a growing linear relationship between the rented bikes per day and the average level of solar radiation.

Proceeding a linear regression on these two variables, we find that an increase of 0.1 MJ.m² in the level of solar radiation leads to an increase of about 2300 rented bikes per day. This result has to be nuanced since the average level of solar radiation is close to 0 and it is a variable which fluctuates little.

Linear regression of the daily rented bike count on the average solar radiation
Term Estimate Sd T-statistic p-value
(Intercept) 4352.39 737.977 5.898 < 0.001
Sol_Rad_avg 23132.254 1136.06 20.362 < 0.001

The wind speed’s influence on the number of rented bikes

Although R plots an inverted U-shaped relationship between the daily rented bike count and the average wind speed, the point cloud is scattered.

A decreasing relationship can be noticed when the wind speed starts to be felt.

A t-test to check whether the number of rented bikes is more important when the average wind speed is low - i.e. lower than the average wind speed’s median - we conducted a t-test.

Does the wind speed have an impact on bike rental ?
mean in group Low mean in group High p-value
Test results 18 490 16 490 0.02952

As the test p-value is lower than the significance level \(\alpha\) = 5%, the ‘greater’ alternative hypothesis can be accepted. In other words, people rent more bikes when there is little wind.

Conclusion

Coming back to the questions we asked at the beginning of our analysis, there is no doubt the Seoul share bike service is a success as shown by the t-test on the two samples ‘Dec17’ and ‘Nov18’.

We also asked ourselves what was the other variables’ influence on the number of rented bikes.

After having split our case study into two parts, we have found the rented bike count depends both on time-related variables and meteorological variables.

As regards the temporal ones, the more important use of the service during daytime and no-holiday period indicates the Seoul bike sharing service is work-oriented.

There are also more rented bikes during sunny months, especially in June, to such a point that the Summer season stands out from the others. Indeed the part on meteorological variables emphasizes an increasing relationship between the number of rented bikes and both the temperatures and the solar radiation level.

The coming part will aim at grouping variables which have similarities in order to avoid overfitting in the models we will estimate. We will also make classification of days based on their features. To this end we shall apply PCA and CA techniques on the ‘Seoul Bike’ data.